6 research outputs found

    Minimally Supervised Categorization of Text with Metadata

    Full text link
    Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe

    Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

    Full text link
    Manually annotating datasets requires domain experts to read through many documents and carefully label them, which is often expensive. Recently, pre-trained generative language models (GLMs) have demonstrated exceptional abilities in generating text which motivates to leverage them for generative data augmentation. We improve generative data augmentation by formulating the data generation as context generation task and use question answering (QA) datasets for intermediate training. Specifically, we view QA to be more as a format than of a task and train GLMs as context generators for a given question and its respective answer. Then, we cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which is further used as synthetic training data for their corresponding tasks. We perform extensive experiments, case studies, and ablation studies on multiple sentiment and topic classification datasets and demonstrate substantial improvements in performance in few-shot, zero-shot settings. Remarkably, on the SST-2 dataset, intermediate training on SocialIQA dataset achieves an improvement of 40% on Macro-F1 score. Through thorough analyses, we observe that QA datasets that requires high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings

    LOPS: Learning Order Inspired Pseudo-Label Selection for Weakly Supervised Text Classification

    Full text link
    Weakly supervised text classification methods typically train a deep neural classifier based on pseudo-labels. The quality of pseudo-labels is crucial to final performance but they are inevitably noisy due to their heuristic nature, so selecting the correct ones has a huge potential for performance boost. One straightforward solution is to select samples based on the softmax probability scores in the neural classifier corresponding to their pseudo-labels. However, we show through our experiments that such solutions are ineffective and unstable due to the erroneously high-confidence predictions from poorly calibrated models. Recent studies on the memorization effects of deep neural models suggest that these models first memorize training samples with clean labels and then those with noisy labels. Inspired by this observation, we propose a novel pseudo-label selection method LOPS that takes learning order of samples into consideration. We hypothesize that the learning order reflects the probability of wrong annotation in terms of ranking, and therefore, propose to select the samples that are learnt earlier. LOPS can be viewed as a strong performance-boost plug-in to most of existing weakly-supervised text classification methods, as confirmed in extensive experiments on four real-world datasets
    corecore